1 Introduction to ChIP-seq

1.1 What is ChIP-seq?

Chromatin immunoprecipitation experiments followed by sequencing (ChIP–seq) detect protein–DNA binding events and chemical modifications of histone proteins (Figure 1).

Comparison of ChIP-seq experimental protocols (adapted from Furey 2012, Nature Reviews).

Comparison of ChIP-seq experimental protocols (adapted from Furey 2012, Nature Reviews).

Profiles obtained from transcription factor (TF) ChIP-seq and histone ChIP-seq are different (Figure 2):

  • TF ChIP-seq detects enrichment at the cis-regulatory element (CRE), where the TFs bind.
  • Histone ChIP-seq detects enrichment at the sorrounding nucleosomes, where histone residues are modified.
The transcription factor and histone modification landscape for inactive and active cis-regulatory elements (CRE; promoters and enhancers) and the corresponding read profiles (from Pundhir et al. 2015, Frontiers in Genetics).

The transcription factor and histone modification landscape for inactive and active cis-regulatory elements (CRE; promoters and enhancers) and the corresponding read profiles (from Pundhir et al. 2015, Frontiers in Genetics).

When sequencing ChIP-seq samples, we usually need a control for the antibody inespecific binding, known as input. We can use as input bulk DNA or DNA immunoprecipitated with IgG.

1.2 Sequencing parameters

You have to consider that when sequencing ChIP-seq, many of the parameters you can choose depend on the specifics of your experiment, your question and your budget! However, to provide you with some general guidelines, this are some general recommendations:

  • Paired or single-end. Usually single-end reads is sufficient for ChIP-seq analysis.
  • Sequencing depth. Between 30M (for narrow peaks) and 60M (for broad marks) sequenced reads should yield good profiles.
  • Read length. 50bp reads are usually sufficient. Could be interesting to make them longer when dealing with broader regions, but not that necessary when regions are narrow.

2 ChIP-seq Workflow

2.1 The dataset

The datasets we are going to use for this workshop are from Hlady et al., 2019 Hepatology, in which they compare several assays in normal liver with cirrotic liver and hepatocellular carcinoma (HCC) samples from human donors.

All raw data from their assays can be accessed via their GEO ID: GSE112221. The Gene Expression Omnibus (GEO) is a public functional genomics data repository that accepts array- and sequence-based data. Each dataset is given a GEO ID that usually starts with GSE. Inside each dataset, you can find independent samples with identifiers prefixed GSM.

In GEO you can download already processed data, such as bed files. However, we here are interested in downloading the raw fastq files so we can do all the processing ourselves. Therefore, for each sample we are interested in, we need to find the SRA ID (SRX, you can find it in the sample page in GEO) and from there, gather the run ID (SRR) which will give us access to the raw data.

Go to the dataset website at GEO (here). Look around and, using what is explained above, find the run ID for the normal liver sample of the H3K27ac ChIP-seq.
  1. Go to the section Samples (55) and click on + More.
  2. Find the sample named NL1_h3k27ac_h3k27ac_chipseq and click on the GEO sample id (GSM3061124).
  3. Find the section named Relations and then click on the SRA ID (SRX3832997)
  4. On the bottom you will see a table named Runs that includes all the runs for that sample.
  5. The run ID for the normal liver H3K27ac sample is SRR6880492

You can download the files directly from the SRA website by clicking on the run ID, but you can also do so from the terminal using the SRA toolkit tool fastq-dump.

With fastq-dump you will be able to download fastq files directly from SRA. Since these samples are paired-end, you need to include the argument --split-3 to send each pair to a different file. --gzip will compress the fastq output to fastq.gz.

# Download fastq files from SRA database
fastq-dump --split-3 --gzip -O your_folder/ SRR6880492 

Here you have a complete list of the samples we are going to use in this workshop and their SRA run IDs:

Sample names SRA run ID Sequenced reads Description
NL1_h3k27ac SRR6880492 19,808,136 H3K27ac ChIP-seq in Normal Liver
Cirr1_h3k27ac SRR6880493 26,149,171 H3K27ac ChIP-seq in Cirrotic Liver (Patient 1)
HCC1_h3k27ac SRR6880494 18,789,828 H3K27ac ChIP-seq in Hepatocellular Carcinoma (Patient 1)
Cirr3_h3k27ac SRR6880495 25,896,625 H3K27ac ChIP-seq in Cirrotic Liver (Patient 3)
HCC3_h3k27ac SRR6880496 21,954,923 H3K27ac ChIP-seq in Hepatocellular Carcinoma (Patient 3)
NL1_input SRR6880507 35,050,237 Input ChIP-seq in Normal Liver
Cirr1_input SRR6880509 24,301,177 Input ChIP-seq in Cirrotic Liver (Patient 1)
HCC1_input SRR6880511 27,006,532 Input ChIP-seq in Hepatocellular Carcinoma (Patient 1)
Cirr3_input SRR6880513 36,198,135 Input ChIP-seq in Cirrotic Liver (Patient 3)
HCC3_input SRR6880515 14,131,132 Input ChIP-seq in Hepatocellular Carcinoma (Patient 3)

2.2 General workflow

A general workflow for ChIP-seq analysis consists on the following steps:

  1. Qualiy control. Check if reads have good overall quality. Trim adapters and reads if approppriate.
  2. Alignment. Align reads to a reference genome to obtain their genomic coordiantes.
  3. Post-processing. Remove duplicates and unwanted chromosomes.
  4. Peak calling. Find regions of enrichment (TF binding sites or nucleosomes with the histone mark of interest).
  5. Visualization. Visualize profile in a genome browser.

2.3 Some technical details

3 Quality control

Before starting any processing steps, one should check the overall quality of the reads obtained from the sequencing facility. To do so, we are going to use FastQC, a quality control tool for high-troughput sequencing data.

fastqc folder_with_fastq/sample_reads.fastq

After checking that the overall quality is good, one can trim reads and/or filter out reads with low overall quality using different tools1.

Now run fastqc for the control liver sample.
fastqc fastq/NL1_H3K27ac.fastq

4 Alignment

We call alignment (or mapping) to the process of comparing reads to a reference genome, scoring the similarity and then assigning most likely genomic coordinates to each read.

4.1 Reference genomes

A reference genome is a database of DNA sequences, assembled as a representative of a species genome. Usually, one can find two versions of the reference genome: NCBI (starts with GRC) and UCSC (hg for human, mm for mice). The difference between the two basically lies in the naming of the chromosomes: NCBI uses numbers (i.e. 1, 2, X) and UCSC numbers plus chr (i.e. chr1, chr2, chrX).

Nowadays, the most used versions of the human genome are the following:

  • hg19/GRCh37. Even though this is not the latest version, is the most broadly used. If you do not specifically need high accuracy when mapping your reads, you should use this, as many tools (such as GREAT) will only work with this version.
  • hg38/GRch38. This is the newest version. Even though it was relased in 2013, many labs still use the previous version.

There are tools to convert coordinates from one build to another, such as liftOver from UCSC.

4.2 Aligners

Aligners are tools specifically designed to map reads to a reference genome using different algorithms. Many aligners are available, but the most used for aligning ChIP-seq data are:

  • Bowtie2. This is the one we will use in this workshop.
  • BWA.

Both tools allow alignment of single and paired end reads, you just have to specify it when running them. Generating an index is a techinque that many aligners use to speed up their running type. You may think about it as a book index: when having an idex it is then easier and faster to find specific parts of the book (i.e. the reference genome). Paper describing indexing methods

The output of the aligner is a SAM/BAM file containing the read sequence and the assigned coordinates respect to the reference genome. The aligner will report the number of aligned reads (which should be >80% of the sequenced reads) and which of those where uniquely mapped or multi-mapping (they match more than one region in the genome). Aligners such as bowtie2 report the best alginment for multi-mapping reads.

Many papers have tried to compare the performance of the different aligners available. However, my advice to you is to look in the literature for papers analyzing data similar to yours and just copy their methods!

4.3 SAM/BAM files

SAM (Sequence Alignment/Map) and BAM (same as SAM but binary, compressed and indexed) are file types used for storing sequence and alignment information.

Can you tell which one of the folloing files is in SAM and which one in BAM format?

File 1:

�BC�W[�Y�gbb&J��“����έ��q�鮙4۷t�
��P�m&=�m�z���� >+���ºB\(���|}�U�}[��U�._4�����T����������������/�WN?p�Bem��l�n��g��`4�.�'�����z������ �t��� !�I�\)��’1�M#0��@
3�>�”�0��/æ�R��P�T�D ��L&���#�1BM������P2��p��`I$�I}�M�D^t�P�h2f�cFɠ�ό�

File 2:

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
  • File 1 is a BAM file. As BAM files are binary and compressed, they are not readable.
  • File 2 is a SAM file. SAM files are easily readable but their size is much bigger than BAM files!

BAM/SAM files consist on two different parts:

  • Header: starts with @XX and contains information on the file itself: how it is ordered, which is the reference genome used, which program you used to align the reads, etc.
  • Body: contains information on the sequences, their genomic coordinates, mapping qualities, etc.

You can find more information on all the fields present in the SAM/BAM files from the Samtools documentation.

Can you tell which lines are the header and which lines are the body of this SAM file?

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
  • The first two lines starting with @HD and @SQ are the header.
    • The first line (HD) is telling you the format version (VN:1.6) and that the file is sorted by coordinates (SO:coordinate).
    • The second line (SQ) contains information on the reference genome used for the alignment: the name of the reference genome (SN:ref) and it length (LN:45).
  • The following lines contain infomation the each one of the reads aligned to the reference genome:
    1. Read name.
    2. Flag. Bitwise information on the read status (mapped, unmapped, paired, unpaired, etc.).
    3. Reference sequence name.
    4. Position in the sequence
    5. Mapping quality (MAPQ)
    6. […]
[…] You can find more information on all the fields present in the SAM/BAM files from the Samtools documentation.

4.4 Hands on! Let’s align this thing!

Step 1: Download and index the reference genome.

Before running an alignment, you will have to download the reference genome you want to use and index it.

For downloading hg19 you can go to UCSC downloads > Human > hg19 > Full data set > hg19.2bit. 2bit is a compression that UCSC uses for making fasta files smaller. You can convert a 2bit file into a fasta file by running twoBitToFasta from UCSC tools.

twoBitToFastq hg19.2bit hg19.fa # Uncompress the file
bowtie2-build hg19.fa hg19 --threads 5 # Create index with Bowtie2

Step 2: Align your reads to the reference genome.

We got our index and we got our reads, so let’s run the alignment! You can specify different parameters for the alignment (check bowtie2 -h to see them all), we are going to use the following:

  • -t: Print the time it takes to perform each step.
  • -x: Index filename prefix (without the trailing .X.bt2).
  • -U: Single-end reads file (if it were paired end you need to specify the first pair file with -1 and the second pair with -2). Fastq files can be gzipped (extension: .gz), so no need to unzip before aligning!
  • -S: Specify file for SAM output.
bowtie2 -t -x index -U single_end_reads.fastq -S output.sam

Step 3: Explore the output of the alignment

  • Percentage of alignment.
  • File size.
  • File format.

5 Post-processing

Once your alignment is done, you usually will have to perform some routine steps to process the the output:

  • The output is usually is a SAM file. You should convert it to BAM, which is a more light-weight way of storing the same data.
  • You should sort your BAM file by genomic coordinates.
  • You can also filter out some reads you do not want in your BAM file: duplicates, reads mapping to ENCODE black-listed regions, chromosomes you are not interested in, etc.
  • You should index your BAM file. As with reference genomes, you can create a index for your BAM file so operations requiring accessing parts of it will run much faster.

Usually, for running operations on BAM or SAM files, the preferred program is Samtools.

5.1 Converting SAM to BAM

This first step can be easily done using samtools view, with the following arguments:

  • -@ 4: Indicates the number (4) of additional threads to use (default is 0).
  • -b: Tells samtools to convert to BAM.
  • -o: Output filename for the BAM file.
samtools view data/ChIP-seq/BAM/file.sam -@ 4 -b -o data/ChIP-seq/BAM/file.raw.bam

5.2 Sorting your BAM file by coordinate

Most programs need an input BAM file in which reads are sorted according to their genomic position. To order our BAM file we can use samtools sort with the following parameters:

  • -o: Output filename for the sorted BAM file.
  • -m 2G: Tells Samtools to use up to 2G per thread.
  • -@ 4: Uses 4 __additional__threads.
samtools sort data/ChIP-seq/BAM/file.raw.bam -o data/ChIP-seq/BAM/file.srtd.bam -m 2G -@ 4

Do you remember bash pipes? You can use them with samtools too! You can pipe the output of samtools view (a BAM file) to samtools sort adding - , without having to create any intermediate files:

samtools view data/ChIP-seq/BAM/file.sam -b | samtools sort - -o data/ChIP-seq/BAM/file.srtd.bam

5.3 Removing duplicates

  • -r: Remove duplicate reads.
  • -@ 4: Use 4 additional threads.
samtools markdup data/ChIP-seq/BAM/file.srtd.bam data/ChIP-seq/BAM/file.rmdup.bam -r -@ 4

5.4 Indexing your BAM file

samtools index file.rmdup.bam

6 Peak calling

7 Visualization


  1. We will not cover this in this workshop but if you are interested in this, you should check AfterQC